Start up

yufan_yin_week0: 9.9. - 15.9.2020

Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/

and the repository: https://github.com/yufanyin/datavis-R

.1 Describe my dataset

Structure of the data

learning2019 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019.csv", stringsAsFactors = TRUE) 
str(learning2019)
## 'data.frame':    218 obs. of  17 variables:
##  $ 锘縞luster     : int  3 2 1 1 3 1 2 2 1 3 ...
##  $ unref          : num  4 2 3 2 3 ...
##  $ deep           : num  3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
##  $ orga           : num  3.33 3 4.33 3.67 2.67 ...
##  $ blocks         : num  3.33 3.67 3.67 3 3.67 ...
##  $ procrastination: num  3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
##  $ perfectionism  : num  3.67 3.33 3.33 2.67 2.33 ...
##  $ innateability  : num  1 1.5 3 1.5 2.5 2 2 1 2.5 1 ...
##  $ ktransforming  : num  4 3.67 3.67 3.33 4 ...
##  $ productivity   : num  1.25 2 1.25 2.25 2.25 2.5 3 2.25 2.25 3.75 ...
##  $ gender         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ studentstatus  : int  1 1 1 1 1 1 1 1 1 3 ...
##  $ studylength    : int  39 51 3 3 15 3 3 3 3 3 ...
##  $ writingcourse  : int  2 3 4 0 0 11 0 0 44 35 ...
##  $ monthsamel     : int  2 2 NA 0 NA 2 4 NA 3 2 ...
##  $ no             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ faculty        : int  2 8 5 9 2 6 4 4 4 9 ...

The aim of the study is to investigate the interrelationships between the approaches to learning and conceptions of academic writing among international university students. Altogether 218 international students of the university participated in the study in 2018 and 2019. Students were divided into homogeneous groups based on their Z scores on the three approaches to learning. Then we compare mean differences and ANOVA results between the profiles.

The data ‘learning2019’ consists of 218 observations and 17 variables. It contains their scores of approaches to learning (different ways that students process information: unreflective studying, deep approach to learning and organised studying), conceptions of academic writing (blocks, procrastination, perfectionism, innate ability, knowledge transforming and productivity), and some background information (categorical variables, eg:gender, age, faculty, student status and study length).

The explanation of some columns are as follows. Each of them was average value of 2-4 questions in 5-point Likert scale (1= totally disagree, 5 = fully agree).

  • “unref”: relying on memorisation in the learning process, lacking the reflective approach to studying and applying the fragmented knowledge base.

  • “deep”: comprehending the intentional content, using evidence and integrating with previous knowledge.

  • “orga”: time management, study organisation, effort management and concentration.

  • “blocks”: the inability to write productively whose reason is not intellectual capacity or literary skills.

  • “procrastination”: failing to start or postponing tasks like preparing for exams and doing homework.

  • “perfectionism”: setting overly high standards, pursuing flawlessness, and evaluating one’s behavior critically.

  • “innateability”: writing is a skill which “is determined at birth” or “cannot be taught or developed”.

  • “ktransforming”: (knowledge transforming) using writing for developing knowledge and generating new ideas and in the reflective and dialectic processes.

  • “productivity”: (sense of productivity) part of self-efficacy in writing.

.2 My previous experience in R

  • I understand the basics of data wrangling.

  • I learned to use R to conduct anayses such as clustering and classification not very proficiently.

Because I attended the course “Introduction to Open Data Science” (HYMY-909, 5 cr) last autumn. Here are the link to my github repository:

https://github.com/yufanyin/IODS-project

and my course diary:

https://yufanyin.github.io/IODS-project/

.3 Expectations for this course

  • To learn practical data visualization skills using R and the ggplot2 -library. I know little in data visualization.

  • To learn about good data visualization and avoid bad/incorrect operation.

  • To produce rich, accurate and concise visualizations using my own data. I have found the proper method to deal with my own data and conducted using SPSS. Attending this course can help me produce better visualizations, which will benefit me a lot when I submit my FIRST article at the end of this year.


Week 1 Exercises

yufan_yin_week1: 16.9. - 21.9.2020

Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/

(It is the habit because of another course.)

Exercise 1

# Create a vector named my_vector. It should have 7 numeric elements.
my_vector <- c(20, 14, 18, 14, 10, 16, 16)

# Print your vector
my_vector
## [1] 20 14 18 14 10 16 16
# Calculate the minimum, maximum, and median values of your vector
summary(my_vector)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   14.00   16.00   15.43   17.00   20.00
# Print "The median value is XX"
mean_exercise1 <- mean(my_vector) # Output from functions can be saved to objects
paste("The median value is ", mean_exercise1) # Use the paste() function to print the object with text
## [1] "The median value is  15.4285714285714"

Exercise 2

# Create another vector named my_vector_2. It should have the elements of my_vector divided by 2.
my_vector_2 <- my_vector/2 # Access individual elements of a vector with indices
my_vector_2
## [1] 10  7  9  7  5  8  8
# Create a vector named my_words. It should have 7 character elements.
my_words <- c("swan", "goose", "mallard", "blue_tit", "philomelos", "sparrow", "gull")

# Combine my_vector and my_words into a data frame.
df <- data.frame(my_vector, my_words)
df
##   my_vector   my_words
## 1        20       swan
## 2        14      goose
## 3        18    mallard
## 4        14   blue_tit
## 5        10 philomelos
## 6        16    sparrow
## 7        16       gull
# Show the structure of the data frame.
str(df)
## 'data.frame':    7 obs. of  2 variables:
##  $ my_vector: num  20 14 18 14 10 16 16
##  $ my_words : chr  "swan" "goose" "mallard" "blue_tit" ...

Exercise 3

library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------------- tidyverse 1.3.0 --
## √ ggplot2 3.3.2     √ purrr   0.3.4
## √ tibble  3.0.3     √ dplyr   1.0.2
## √ tidyr   1.1.2     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.5.0
## -- Conflicts -------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# Use the head() function to print the first 3 rows of your data frame.
head(df) #How to print the first 3 rows instead of 5?
##   my_vector   my_words
## 1        20       swan
## 2        14      goose
## 3        18    mallard
## 4        14   blue_tit
## 5        10 philomelos
## 6        16    sparrow
# Create a new variable to the data frame which has the values of my_vector_2 (remember to save the new variable to the data frame object).
pair <- c(my_vector_2)
pair
## [1] 10  7  9  7  5  8  8
df2 <- data.frame(df,pair)
df2
##   my_vector   my_words pair
## 1        20       swan   10
## 2        14      goose    7
## 3        18    mallard    9
## 4        14   blue_tit    7
## 5        10 philomelos    5
## 6        16    sparrow    8
## 7        16       gull    8
# Use filter() to print rows of your data frame greater than the median value of my_vector.
df2 %>% filter(df2 > mean(my_vector))
##   my_vector my_words pair
## 1        20     swan   10
## 2        18  mallard    9
## 3        16  sparrow    8
## 4        16     gull    8

Week 2 Exercises

yufan_yin_week2: 23.9. - 28.9.2020

Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/

Exercise 1

1.1 Loading libraries and suppressing any output messages in the chunk settings

Create a new code chunk where you load the tidyverse package. In the chunk settings, suppress any output messages.

1.2 Reading the data

The tibble df has 60 observations (rows) of variables (columns) group, gender, age, score1 and score2 (continuous scores from two tests). Each row represents one participant.

df
## # A tibble: 60 x 4
##    group gender score1 score2          
##    <int> <chr>   <dbl> <chr>           
##  1     2 F        18.7 14.7563711082321
##  2     1 M        20.1 15.1463059324341
##  3     2 F        17.4 19.0025387614538
##  4     1 M        18.7 15.5693261509451
##  5     2 F        18.5 16.7322250273729
##  6     1 999      16.9 16.4511010915052
##  7     2 M        20.4 15.1008590050657
##  8     1 F        20.3 15.191041952879 
##  9     1 F        19.4 13.9717194882152
## 10     2 M        21.2 22.6918520246433
## # ... with 50 more rows

There is something to fix in three of the variables. Explore the data and describe what needs to be corrected.

Hint: You can use e.g. str(), distinct(), and summary() to explore the data.

str(df)
## tibble [60 x 4] (S3: tbl_df/tbl/data.frame)
##  $ group : int [1:60] 2 1 2 1 2 1 2 1 1 2 ...
##  $ gender: chr [1:60] "F" "M" "F" "M" ...
##  $ score1: num [1:60] 18.7 20.1 17.4 18.7 18.5 ...
##  $ score2: chr [1:60] "14.7563711082321" "15.1463059324341" "19.0025387614538" "15.5693261509451" ...
summary(df)
##      group        gender              score1         score2         
##  Min.   :1.0   Length:60          Min.   :14.17   Length:60         
##  1st Qu.:1.0   Class :character   1st Qu.:16.85   Class :character  
##  Median :1.5   Mode  :character   Median :17.61   Mode  :character  
##  Mean   :1.5                      Mean   :17.89                     
##  3rd Qu.:2.0                      3rd Qu.:19.01                     
##  Max.   :2.0                      Max.   :21.53
distinct(df)
## # A tibble: 60 x 4
##    group gender score1 score2          
##    <int> <chr>   <dbl> <chr>           
##  1     2 F        18.7 14.7563711082321
##  2     1 M        20.1 15.1463059324341
##  3     2 F        17.4 19.0025387614538
##  4     1 M        18.7 15.5693261509451
##  5     2 F        18.5 16.7322250273729
##  6     1 999      16.9 16.4511010915052
##  7     2 M        20.4 15.1008590050657
##  8     1 F        20.3 15.191041952879 
##  9     1 F        19.4 13.9717194882152
## 10     2 M        21.2 22.6918520246433
## # ... with 50 more rows

The dataset df consists of 60 observations and 5 variables.It contains the membership of group, gender, age, score1, score2.

Exercise 2

2.1 Tidying data

Make the corrections you described above.

df <- df %>%
  mutate(gender = na_if(gender, 999)) # recode 999 to NA (missing)
  df$score2 <- as.numeric(df$score2) # convert a character vector to a numeric vector

2.2 Counting observations by grouping variables

Count observations by group and gender. Arrange by the number of observations (ascending).

df %>%
  count(group, gender) %>% # count() is a combination of group_by() and tally()
  arrange(desc(n)) %>% # OR: "%>% floor()"?
  arrange(group)
## # A tibble: 6 x 3
##   group gender     n
##   <int> <chr>  <int>
## 1     1 M         14
## 2     1 F         13
## 3     1 <NA>       3
## 4     2 F         15
## 5     2 M         14
## 6     2 <NA>       1

Exercise 3

3.1 Creating a new variable: the difference between scores

Create a new variable, score_diff, that contains the difference between score1 and score2.

df$score_diff <- df$score1 - df$score2

3.2 Computing the means: using summarise() to take multiple variables in one go

Compute the means of score1, score2, and score_diff.

Hint: Like mutate(), summarise() can take multiple variables in one go.

df %>%
  summarise(score1_mean = mean(df$score1), score2_mean = mean(df$score2), score_diff_mean = mean(df$score_diff))
## # A tibble: 1 x 3
##   score1_mean score2_mean score_diff_mean
##         <dbl>       <dbl>           <dbl>
## 1        17.9        16.1            1.82

3.3 Computing the means by grouping variable

Compute the means of score1, score2, and score_diff by gender.

grouped_df <- df %>%
  group_by(gender)

grouped_df %>%
  summarise(score1_mean = mean(df$score1), score2_mean = mean(df$score2), score_diff_mean = mean(df$score_diff))
## # A tibble: 3 x 4
##   gender score1_mean score2_mean score_diff_mean
##   <chr>        <dbl>       <dbl>           <dbl>
## 1 F             17.9        16.1            1.82
## 2 M             17.9        16.1            1.82
## 3 <NA>          17.9        16.1            1.82
  # the results look strange but I do not know what went wrong

Exercise 4

4.1 Creating an x-y scatter plot

Using ggplot2, create a scatter plot with score1 on the x-axis and score2 on the y-axis.

df %>%
  ggplot(aes(score1, score2)) + # x = score1, y = Sscore2
  geom_point()

4.2 Setting colour based on grouping variable, figure width and height

Continuing with the previous plot, colour the points based on gender.

Set the output figure width to 10 and height to 6.

df %>%
  ggplot(aes(score1, score2, color = gender)) + # x = score1, y = score2
  geom_point()

Exercise 5

Note: I did this part in another rmd file named ‘index’.

see: https://github.com/yufanyin/datavis-R/blob/master/index.Rmd

5.1 Metadata section

Add the author (your name) and date into the metadata section. Create a table of contents.

5.2 Knitting

Knit your document to HTML by changing html_notebook to html_document in the metadata, and pressing Knit.

See the results in my course diary: https://yufanyin.github.io/datavis-R/


Week 3 Exercises

yufan_yin_week3: 29.9. - 5.10.2020

Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/

library(tidyverse)

Exercise 1

1.1 Reading the data

Read the data into R. It have 211 observations of 17 variables.

learning2019 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019_week3.csv", stringsAsFactors = TRUE)
learning19 <- learning2019 %>%
  mutate(studylength = as.numeric(studylength),
         writingcourse = as.numeric(writingcourse))

str(learning19)
## 'data.frame':    206 obs. of  17 variables:
##  $ 锘縩o          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ cluster        : int  3 2 1 1 3 1 2 2 1 3 ...
##  $ unref          : num  4 2 3 2 3 2.67 1 2.33 3 3.67 ...
##  $ deep           : num  3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
##  $ orga           : num  3.33 3 4.33 3.67 2.67 4 2.33 3.33 4 3.67 ...
##  $ blocks         : num  3.33 3.67 3.67 3 3.67 4 2.67 2.33 3.33 2.67 ...
##  $ procrastination: num  3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
##  $ perfectionism  : num  3.67 3.33 3.33 2.67 2.33 2.33 4 2.67 3 3.33 ...
##  $ innateability  : num  1 1.5 3 1.5 2.5 2 2 1 2.5 1 ...
##  $ ktransforming  : num  4 3.67 3.67 3.33 4 3.33 2 4.33 4 4.33 ...
##  $ productivity   : num  1.25 2 1.25 2.25 2.25 2.5 3 2.25 2.25 3.75 ...
##  $ gender         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ studentstatus  : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ studylength    : num  39 51 3 3 15 3 3 3 3 3 ...
##  $ writingcourse  : num  2 3 4 0 0 11 0 0 44 35 ...
##  $ monthsamel     : int  2 2 NA 0 NA 2 4 NA 3 2 ...
##  $ faculty        : int  2 8 5 9 2 6 4 4 4 9 ...

1.2 Creating categorical variable

For my data, studylength is more suitable to be the categorical variable than age. It discribes how many months that students have studied in the university.

Cut the continuous variable studylength into a categorical variable studylength_group. Use ggplot2’s cutting function: cut_number() makes n groups with (approximately) equal number of observations.

Count observations by studylength group.

library(ggplot2)
learning19 %>%
  mutate(score_group_test = cut_width(studylength, 12, boundary = 0)) %>% # range width is (max - min) / number of groups
  count(score_group_test)
##   score_group_test   n
## 1           [0,12] 102
## 2          (12,24]  47
## 3          (24,36]  19
## 4          (36,48]  14
## 5          (48,60]  16
## 6          (60,72]   5
## 7          (72,84]   2
## 8        (168,180]   1
library(ggplot2)
learning19 %>%
  mutate(studylength_group = cut_number(studylength, 3)) %>% # each group has about 206 / 3 = 68 observations
  count(studylength_group)
##   studylength_group  n
## 1             [2,7] 71
## 2            (7,17] 67
## 3          (17,172] 68

Save the results with labels to the data.

learning19 <- learning19 %>%
  mutate(studylength_group = cut_number(studylength, 3,
                                 labels = c('-7','8-17','18-')))
learning19 %>% 
  distinct(studylength_group)
##   studylength_group
## 1               18-
## 2                -7
## 3              8-17

Exercise 2

The chunk below is supposed to produce a plot but it has some errors.

The figure should be a scatter plot of cluster (different student profiles) on the x-axis and blocks on the y-axis, with points coloured by studylength_group (3 levels). It should also have three linear regression lines, one for each of the education levels.

Fix the code to produce the right figure.

What happens if you use geom_jitter() instead of geom_point()?

Hint: Examine the code bit by bit: start by plotting just the scatter plot without geom_smooth(), and add the regression lines last.

learning19 %>% 
  ggplot(aes(cluster, blocks, fill = studylength_group)) + 
  geom_col(position = "dodge") + 
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

learning19 %>%
  ggplot(aes(cluster, blocks)) + 
  geom_col() +
  facet_wrap(~studylength_group)

Exercise 3

3.1

Calculate the mean, standard deviation (sd), and number of observations (n) of score on blocks by student profiles and study-length group. Also calculate the standard error of the mean (by using sd and n). Save these into a new data frame (or tibble) named cluster_blocks_stats.

cluster_blocks_stats <- learning19 %>%
  group_by(cluster, studylength_group, .drop = FALSE) %>% # there are no observations some of the combinations, but we don't drop them
  summarise(mean_blocks = mean(blocks),
            sd_blocks = sd(blocks),
            n = n()) %>%
  ungroup()
## `summarise()` regrouping output by 'cluster' (override with `.groups` argument)
cluster_blocks_stats
## # A tibble: 9 x 5
##   cluster studylength_group mean_blocks sd_blocks     n
##     <int> <fct>                   <dbl>     <dbl> <int>
## 1       1 -7                       2.48     0.981    31
## 2       1 8-17                     2.49     0.870    37
## 3       1 18-                      2.38     0.685    26
## 4       2 -7                       2.85     0.922    27
## 5       2 8-17                     2.53     0.775    22
## 6       2 18-                      2.59     0.936    27
## 7       3 -7                       3.44     0.906    13
## 8       3 8-17                     2.88     1.15      8
## 9       3 18-                      3.04     0.845    15
learning19 %>%
  ggplot(aes(cluster, blocks)) + 
  geom_col() +
  facet_wrap(~studylength_group)

3.2

Using cluster_blocks_stats, plot a bar plot that has cluster on the x-axis, mean score of blocks on the y-axis, and studylength levels in subplots (facets).

Use geom_errorbar() to add error bars that represent standard errors of the mean.

learning19 %>%
  ggplot(aes(cluster, blocks)) + 
  geom_bar(stat = "summary", fun.data = "mean_se") +
  facet_wrap(~studylength_group)

  stat_summary(geom = "errorbar", fun.data = "mean_se") 
## geom_errorbar: na.rm = FALSE, orientation = NA
## stat_summary: fun.data = mean_se, fun = NULL, fun.max = NULL, fun.min = NULL, fun.args = list(), na.rm = FALSE, orientation = NA
## position_identity

Exercise 4

4.1

Create a figure that has boxplots of cluster (x-axis) by blocks (y-axis).

Note: What does ‘Ord.factor’ mean? I do not know how to change the type of the variable cluster.

learning19 %>%
  ggplot(aes(cluster, blocks)) + 
  geom_boxplot() +
  facet_wrap(~studylength_group)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

4.2

Group the data by cluster and add mean score of blocks by cluster to a new column mean_score. Do this with mutate() (not summarise()).

Reorder the levels of cluster based on mean_score.

Hint: Remember to ungroup after creating the mean_score variable.

Note: Maybe such types of the variables in my data is not suitable for these operation.

Exercise 5

Using the data you modified in exercise 4.2, plot mean scores (x-axis) by cluster (y-axis) as points. The clusters should be ordered by mean score.

Use stat_summary() to add error bars that represent standard errors of the mean.

Hint: Be careful which variable - mean_score or score - you’re plotting in each of the geoms.

Note: Maybe the variables in my data is not suitable for such operation.


Week 4 Exercises

yufan_yin_week4: 6.10. - 12.10.2020

Also see in the page to my course diary: https://yufanyin.github.io/datavis-R/

Exercise 1

1.1 Reading the data

Read the region_scores.csv data

region_scores <- read.csv(file = "D:/Users/yinyf/datavis-R/week4/region_scores.csv", stringsAsFactors = TRUE)
region_scores <- region_scores %>%
  mutate(id = as.character(id),
         region = factor(region),
         education = factor(education, ordered = TRUE),
         gender = factor(gender))

glimpse(region_scores)
## Rows: 240
## Columns: 6
## $ id        <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", ...
## $ region    <fct> South Karelia, Satakunta, Kymenlaakso, South Karelia, Sou...
## $ education <ord> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ gender    <fct> M, F, M, F, F, F, M, F, M, M, F, M, F, M, M, M, F, M, F, ...
## $ age       <int> 56, 41, 48, 41, 35, 60, 28, 28, 48, 51, 45, 55, 41, 24, 6...
## $ score     <dbl> 4.268811, 5.646586, 6.949019, 7.096777, 6.990985, 5.26766...

Cutting values (score) into intervals

to groups of width 10

region_scores %>%
  mutate(score_group = cut_width(score, 10, boundary = 0)) %>% 
  count(score_group)
##   score_group   n
## 1      [0,10]  55
## 2     (10,20] 154
## 3     (20,30]  31
region_scores <- region_scores %>%
  mutate(score_group = cut_width(score, 10, boundary = 0, 
                                 labels = c('-10','11-20','21-'))) 
region_scores %>% 
  distinct(score_group)
##   score_group
## 1         -10
## 2       11-20
## 3         21-

Column score_group is not found.

region_scores2 <- region_scores %>%
  group_by(education, score_group, .drop = FALSE) %>%
  summarise(mean_age = mean(age),
            sd_age = sd(age),
            n = n()) %>%
  ungroup()
## `summarise()` regrouping output by 'education' (override with `.groups` argument)
region_scores2
## # A tibble: 9 x 5
##   education score_group mean_age sd_age     n
##   <fct>     <fct>          <dbl>  <dbl> <int>
## 1 1         -10             39.5  10.1     46
## 2 1         11-20           38.8  10.2     39
## 3 1         21-            NaN    NA        0
## 4 2         -10             45     9.27     9
## 5 2         11-20           42.2   9.61    65
## 6 2         21-             39.3   7.57     3
## 7 3         -10            NaN    NA        0
## 8 3         11-20           40.1  10.4     50
## 9 3         21-             37.4   8.97    28

1.2 Histograms

Create a figure that shows the distributions (density plots or histograms) of age and score in separate subplots (facets). What do you need to do first?

Note: I’m not sure the group varible to create subplots.

In the figure, set individual x-axis limits for age and score by modifying the scales parameter within facet_wrap().

Question: What went wrong when I used facet_wrap() but saw the warning ‘Layer 1 is missing score_group(or other group variable)’ ? I met last week, too. I saved score_group.

region_scores %>%
  ggplot(aes(age, fill = score_group)) + 
  geom_histogram(position = "identity", alpha = .5, binwidth = 1) 

(Try more as a reminder in future)

region_scores %>%
  ggplot(aes(age, fill = gender)) + 
  geom_histogram(position = "identity", alpha = .5, binwidth = 1) 

region_scores %>%
  ggplot(aes(score, fill = gender)) + 
  geom_histogram(position = "identity", alpha = .5, binwidth = 1) 

1.3 Density plots

Note: I do not understand the meaning of y-axis in such density plots.

region_scores %>%
  ggplot(aes(age, fill = gender)) + 
  geom_density(alpha = .5) 

region_scores %>%
  ggplot(aes(score, fill = gender)) + 
  geom_density(alpha = .5) 

Exercise 2

In this exercise, you will use the built-in iris dataset.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

2.1 Make data into long format

Make the data into long format: gather all variables except species into new variables var (variable names) and measure (numerical values). You should end up with 600 rows and 3 columns (Species, var, and measure). Assign the result into iris_long.

iris_long <- iris %>%
  gather(var, measure, -Species) 
str(iris_long)
## 'data.frame':    600 obs. of  3 variables:
##  $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ var    : chr  "Sepal.Length" "Sepal.Length" "Sepal.Length" "Sepal.Length" ...
##  $ measure: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

2.2 Spread: long-to-wide

In iris_long, separate var into two variables: part (Sepal/Petal values) and dim (Length/Width).

Then, spread the measurement values to new columns that get their names from dim. You must create row numbers by dim group before doing this.

You should now have 300 rows of variables Species, part, Length and Width (and row numbers). Assign the result into iris_wide.

Note: It was a bit complex than the example. I tried many times but failed. So I kept some of the codes in the following chunk.

iris_long %>%
  group_by(Species) %>%
  mutate(row = row_number()) %>%
  ungroup %>%
  spread(?, ?) %>%
  select(-row)

However,

Must extract column with a single valid subscript. x Subscript `var` has the wrong type `data.frame<Sepal.Width:double>`. i It must be numeric or character.

Or:

iris_long %>%
  pivot_wider(names_from = c(var),
  values_from = measure) 
## Warning: Values are not uniquely identified; output will contain list-cols.
## * Use `values_fn = list` to suppress this warning.
## * Use `values_fn = length` to identify where the duplicates arise
## * Use `values_fn = {summary_fun}` to summarise duplicates
## # A tibble: 3 x 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>      <list>       <list>      <list>       <list>     
## 1 setosa     <dbl [50]>   <dbl [50]>  <dbl [50]>   <dbl [50]> 
## 2 versicolor <dbl [50]>   <dbl [50]>  <dbl [50]>   <dbl [50]> 
## 3 virginica  <dbl [50]>   <dbl [50]>  <dbl [50]>   <dbl [50]>

There is still error.

2.3 Scatter plot

Using iris_wide, plot a scatter plot of length on the x-axis and width on the y-axis. Colour the points by part.

iris_wide %>%
  ggplot(aes(Length, Width), color = Species) + # x = length, y = width
  geom_point()

Exercise 3

3.1 Reading my own data

Import your data into R. Check that you have the correct number of rows and columns, column names are in place, the encoding of characters looks OK, etc.

learning2019_w4 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019_week4.csv", stringsAsFactors = TRUE)

3.2

Print the structure/glimpse/summary of the data. Outline briefly what kind of variables you have and if there are any missing or abnormal values. Make sure that each variable has the right class (numeric/character/factor etc).

learning_w4 <- learning2019_w4 %>%
  mutate(studylength = as.numeric(studylength),
         writingcourse = as.numeric(writingcourse))
str(learning_w4)
## 'data.frame':    206 obs. of  10 variables:
##  $ 锘縞luster     : int  3 2 1 1 3 1 2 2 1 3 ...
##  $ unref          : num  4 2 3 2 3 2.67 1 2.33 3 3.67 ...
##  $ deep           : num  3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
##  $ orga           : num  3.33 3 4.33 3.67 2.67 4 2.33 3.33 4 3.67 ...
##  $ blocks         : num  3.33 3.67 3.67 3 3.67 4 2.67 2.33 3.33 2.67 ...
##  $ procrastination: num  3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
##  $ gender         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ studentstatus  : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ studylength    : num  39 51 3 3 15 3 3 3 3 3 ...
##  $ writingcourse  : num  2 3 4 0 0 11 0 0 44 35 ...

Exercise 4 Counting observations by grouping variables

Pick a few (2-5) variables of interest from your data (ideally, both categorical and numerical).

For categorical variables, count the observations in each category (or combination of categories). Are the frequencies balanced?

learning19_w4 %>%
  count(cluster, gender) %>%
  arrange(desc(n)) %>%
  arrange(cluster)

Error: Must group by variables found in .data. * Column cluster is not found. Neither is learning19_w4[1]. Well… I’m not very angry.

For numerical variables, compute some summary statistics (e.g. min, max, mean, median, SD) over the whole dataset or for subgroups. What can you say about the distributions of these variables, or possible group-wise differences?

Overall:

summary(learning_w4)
##    锘縞luster        unref            deep            orga      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.670   1st Qu.:3.750   1st Qu.:2.670  
##  Median :2.000   Median :2.000   Median :4.000   Median :3.330  
##  Mean   :1.718   Mean   :2.178   Mean   :4.007   Mean   :3.411  
##  3rd Qu.:2.000   3rd Qu.:2.670   3rd Qu.:4.500   3rd Qu.:4.000  
##  Max.   :3.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      blocks      procrastination     gender      studentstatus  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.500   1st Qu.:1.000   1st Qu.:2.000  
##  Median :2.670   Median :3.250   Median :2.000   Median :2.000  
##  Mean   :2.655   Mean   :3.212   Mean   :1.714   Mean   :1.767  
##  3rd Qu.:3.330   3rd Qu.:3.750   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :5.000   Max.   :5.000   Max.   :2.000   Max.   :2.000  
##   studylength     writingcourse   
##  Min.   :  2.00   Min.   : 0.000  
##  1st Qu.:  5.00   1st Qu.: 0.000  
##  Median : 14.00   Median : 3.000  
##  Mean   : 19.75   Mean   : 6.694  
##  3rd Qu.: 28.00   3rd Qu.: 6.000  
##  Max.   :172.00   Max.   :91.000

For subgroups:

**Note:" I do not believe the mean values of subgroups divided by gender or student status(Bechelor/Master) could be equal. What’s wrong?

grouped_df <- learning_w4 %>%
  group_by(studentstatus)

grouped_df %>%
  summarise(unref_mean = mean(learning_w4$unref), deep_mean = mean(learning_w4$deep), orga_mean = mean(learning_w4$deep))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
##   studentstatus unref_mean deep_mean orga_mean
##           <int>      <dbl>     <dbl>     <dbl>
## 1             1       2.18      4.01      4.01
## 2             2       2.18      4.01      4.01

We can see studylength (how many month students have been studied in the university) is a better grouping value than (numbers) of writingcourse. But …

Try cluster (student profile based on the combination of scores on ‘unref’, ‘deep’ and ‘orga’)

learning_w4 %>%
  count(learning_w4[1])
##   锘縞luster  n
## 1          1 94
## 2          2 76
## 3          3 36
grouped_learning <- learning_w4 %>%
  group_by(learning_w4[1])

grouped_learning %>%
  summarise(unref_mean = mean(grouped_learning$unref), deep_mean = mean(grouped_learning$deep), orga_mean = mean(grouped_learning$orga))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 4
##   锘縞luster unref_mean deep_mean orga_mean
##        <int>      <dbl>     <dbl>     <dbl>
## 1          1       2.18      4.01      3.41
## 2          2       2.18      4.01      3.41
## 3          3       2.18      4.01      3.41
# the results look strange but I do not know what went wrong

Exercise 5

Describe if there’s anything else you think should be done as “pre-processing” steps (e.g. recoding/grouping values, renaming variables, removing variables or mutating new ones, reshaping the data to long format, merging data frames together).

Do you have an idea of what kind of relationships in your data you would like to visualise and for which variables? For example, would you like to depict variable distributions, the structure of multilevel data, summary statistics (e.g. means), or include model fits or predictions?

5.1 Reading the data

Structure of the data

learning2019 <- read.csv(file = "D:/Users/yinyf/datavis-R/week0/learning2019_w4.csv", stringsAsFactors = TRUE) 
learning19 <- learning2019[1:13]
str(learning19)
## 'data.frame':    211 obs. of  13 variables:
##  $ 锘縞luster     : int  3 2 1 1 3 1 2 2 1 3 ...
##  $ unref          : num  4 2 3 2 3 ...
##  $ deep           : num  3.5 4.25 3.75 4.25 3.25 3.5 4.25 4.25 4 4 ...
##  $ orga           : num  3.33 3 4.33 3.67 2.67 ...
##  $ blocks         : num  3.33 3.67 3.67 3 3.67 ...
##  $ procrastination: num  3.25 4.25 3.75 2.5 4.25 3.5 3.5 4.25 3.25 2.5 ...
##  $ perfectionism  : num  3.67 3.33 3.33 2.67 2.33 ...
##  $ innateability  : num  1 1.5 3 1.5 2.5 2 2 1 2.5 1 ...
##  $ ktransforming  : num  4 3.67 3.67 3.33 4 ...
##  $ productivity   : num  1.25 2 1.25 2.25 2.25 2.5 3 2.25 2.25 3.75 ...
##  $ gender         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ studentstatus  : int  1 1 1 1 1 1 1 1 1 3 ...
##  $ studylength    : int  39 51 3 3 15 3 3 3 3 3 ...

The aim of the study is to investigate the interrelationships between the approaches to learning and conceptions of academic writing among international university students. Altogether 218 international students of the university participated in the study in 2018 and 2019. Students were divided into homogeneous groups based on their Z scores on the three approaches to learning. Then we compare mean differences and ANOVA results between the profiles.

The data ‘learning2019’ consists of 218 observations and 17 variables. It contains their scores of approaches to learning (different ways that students process information: unreflective studying, deep approach to learning and organised studying), conceptions of academic writing (blocks, procrastination, perfectionism, innate ability, knowledge transforming and productivity), and some background information (categorical variables, eg:gender, age, faculty, student status and study length).

The explanation of some columns are as follows. Each of them was average value of 2-4 questions in 5-point Likert scale (1= totally disagree, 5 = fully agree).

  • “unref”: relying on memorisation in the learning process, lacking the reflective approach to studying and applying the fragmented knowledge base.

  • “deep”: comprehending the intentional content, using evidence and integrating with previous knowledge.

  • “orga”: time management, study organisation, effort management and concentration.

  • “blocks”: the inability to write productively whose reason is not intellectual capacity or literary skills.

  • “procrastination”: failing to start or postponing tasks like preparing for exams and doing homework.

  • “perfectionism”: setting overly high standards, pursuing flawlessness, and evaluating one’s behavior critically.

  • “innateability”: writing is a skill which “is determined at birth” or “cannot be taught or developed”.

  • “ktransforming”: (knowledge transforming) using writing for developing knowledge and generating new ideas and in the reflective and dialectic processes.

  • “productivity”: (sense of productivity) part of self-efficacy in writing.

5.2 Exploring the data numerically and graphically

5.2.1 Summaries of the variables

summary(learning19)
##    锘縞luster        unref            deep            orga      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.667   1st Qu.:3.750   1st Qu.:2.667  
##  Median :2.000   Median :2.000   Median :4.000   Median :3.333  
##  Mean   :1.716   Mean   :2.171   Mean   :4.007   Mean   :3.414  
##  3rd Qu.:2.000   3rd Qu.:2.667   3rd Qu.:4.500   3rd Qu.:4.000  
##  Max.   :3.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      blocks      procrastination perfectionism   innateability  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.500   1st Qu.:2.000   1st Qu.:1.000  
##  Median :2.667   Median :3.250   Median :2.333   Median :1.500  
##  Mean   :2.662   Mean   :3.219   Mean   :2.556   Mean   :1.761  
##  3rd Qu.:3.333   3rd Qu.:3.875   3rd Qu.:3.333   3rd Qu.:2.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##  ktransforming    productivity       gender      studentstatus  
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.667   1st Qu.:1.875   1st Qu.:1.000   1st Qu.:3.000  
##  Median :4.000   Median :2.500   Median :2.000   Median :4.000  
##  Mean   :4.041   Mean   :2.487   Mean   :1.716   Mean   :3.185  
##  3rd Qu.:4.667   3rd Qu.:3.250   3rd Qu.:2.000   3rd Qu.:4.000  
##  Max.   :5.000   Max.   :4.750   Max.   :2.000   Max.   :4.000  
##   studylength    
##  Min.   :  2.00  
##  1st Qu.:  5.00  
##  Median : 14.00  
##  Mean   : 21.63  
##  3rd Qu.: 28.50  
##  Max.   :172.00

5.2.2 Relationships between the variables

Calculate and print the correlation matrix

cor_matrix<-cor(learning19[2:10]) %>% round(digits = 2)
cor_matrix
##                 unref  deep  orga blocks procrastination perfectionism
## unref            1.00 -0.48 -0.31   0.33            0.25          0.28
## deep            -0.48  1.00  0.32  -0.27           -0.18         -0.19
## orga            -0.31  0.32  1.00  -0.22           -0.38         -0.14
## blocks           0.33 -0.27 -0.22   1.00            0.55          0.54
## procrastination  0.25 -0.18 -0.38   0.55            1.00          0.35
## perfectionism    0.28 -0.19 -0.14   0.54            0.35          1.00
## innateability    0.16 -0.11 -0.02   0.24            0.13          0.28
## ktransforming   -0.16  0.31  0.16  -0.30           -0.21         -0.25
## productivity    -0.15  0.16  0.30  -0.38           -0.46         -0.22
##                 innateability ktransforming productivity
## unref                    0.16         -0.16        -0.15
## deep                    -0.11          0.31         0.16
## orga                    -0.02          0.16         0.30
## blocks                   0.24         -0.30        -0.38
## procrastination          0.13         -0.21        -0.46
## perfectionism            0.28         -0.25        -0.22
## innateability            1.00         -0.25         0.01
## ktransforming           -0.25          1.00         0.21
## productivity             0.01          0.21         1.00

Specialized according to the significant level and visualize the correlation matrix p.mat <- cor.mtest(cor_matrix)$p

library(corrplot)
## corrplot 0.84 loaded
p.mat <- cor.mtest(cor_matrix)$p
corrplot(cor_matrix, method="circle", type="upper",  tl.cex = 0.6, p.mat = p.mat, sig.level = 0.01, title="Correlations of learning19", mar=c(0,0,1,0))

5.2.3 Creating an x-y scatter plot

learning19 %>%
  ggplot(aes(orga, procrastination, color = cluster)) + # x = orga, y = procrastination
  geom_point()

5.3 K-means clustering

5.3.1 Calculate the distances

Euclidean distance matrix

learning19_eu <- dist(learning19[2:4])
summary(learning19_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.083   1.601   1.741   2.192   6.741

5.3.2 Determine the k

set.seed(123)
k_max <- 5 # determine the number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(learning19[2:4], k)$tot.withinss}) # calculate the total within sum of squares
qplot(x = 1:k_max, y = twcss, geom = 'line') # visualize the results

The twcss value decrease heavily from 2 - 5 clusters. The optimal number of clusters was 3.

5.3.3 Perform k-means clustering

learning19_km <- kmeans(learning19[2:10], centers = 3)

Plot the dataset with clusters

pairs(learning19[2:10], col = learning19_km$cluster)

pairs(learning19[,2:4], col = learning19_km$cluster)

pairs(learning19[,5:10], col = learning19_km$cluster)

The optimal number of clusters was 3. We got the best overview with three clusters.

5.3.4 Perform k-means on the original data

library(devtools)
library(flipMultivariates)
learning19_scaled3 <- scale(learning19[2:4])
learning19_km3 <-kmeans(learning19_scaled3, centers = 3)
cluster <- learning19_km3$cluster
learning19_scaled3 <- data.frame(learning19_scaled3, cluster)
lda.fit_cluster <- lda(cluster ~ ., data = learning19_scaled3)
lda.fit_cluster

Warning in install.packages : package ‘flipMultivariates’ is not available

but I used to run it so I kept the codes.

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes3 <- as.numeric(learning19_scaled3$cluster)
plot(lda.fit_cluster, dimen = 2, col = classes3, pch = classes3, main = "LDA biplot using three clusters")
lda.arrows(lda.fit_cluster, myscale = 2)

5.3.5 3D plot

model_predictors <- dplyr::select(learning19_train, -deep2)
# check the dimensions
dim(model_predictors)
dim(lda.fit$scaling)
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

Next, install and access the plotly package.

Create a 3D plot of the columns of the matrix product.

library(plotly)
plot_ly (x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = learning19_train$deep2)
library(plot3D)

scatter3D(x = learning19$unref, y = learning19$deep, z = learning19$orga, col = NULL, 
          main = "learning19 data", xlab = "deep",
          ylab ="unref", zlab = "orga")

library(plotly)
plot_ly (x = learning19$unref, y = learning19$deep, z = learning19$orga, type= 'scatter3d', mode='markers', color = learning19$deep)